library(ggplot2)
data(diamonds)
theme_set(theme_minimal(20))
dim(diamonds)
## [1] 53940 10
names(diamonds)
## [1] "carat" "cut" "color" "clarity" "depth" "table" "price"
## [8] "x" "y" "z"
str(diamonds)
## 'data.frame': 53940 obs. of 10 variables:
## $ carat : num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
summary(diamonds)
## carat cut color clarity
## Min. :0.200 Fair : 1610 D: 6775 SI1 :13065
## 1st Qu.:0.400 Good : 4906 E: 9797 VS2 :12258
## Median :0.700 Very Good:12082 F: 9542 SI2 : 9194
## Mean :0.798 Premium :13791 G:11292 VS1 : 8171
## 3rd Qu.:1.040 Ideal :21551 H: 8304 VVS2 : 5066
## Max. :5.010 I: 5422 VVS1 : 3655
## J: 2808 (Other): 2531
## depth table price x
## Min. :43.0 Min. :43.0 Min. : 326 Min. : 0.00
## 1st Qu.:61.0 1st Qu.:56.0 1st Qu.: 950 1st Qu.: 4.71
## Median :61.8 Median :57.0 Median : 2401 Median : 5.70
## Mean :61.8 Mean :57.5 Mean : 3933 Mean : 5.73
## 3rd Qu.:62.5 3rd Qu.:59.0 3rd Qu.: 5324 3rd Qu.: 6.54
## Max. :79.0 Max. :95.0 Max. :18823 Max. :10.74
##
## y z
## Min. : 0.00 Min. : 0.00
## 1st Qu.: 4.72 1st Qu.: 2.91
## Median : 5.71 Median : 3.53
## Mean : 5.73 Mean : 3.54
## 3rd Qu.: 6.54 3rd Qu.: 4.04
## Max. :58.90 Max. :31.80
##
Most diamonds are of ideal cut. The median carat size is 0.7. Most diamonds have a color of G or better. About 75% of diamonds have carat weights less than 1. The median price for a diamonds $2401 and the max price is $18,823.
I'm going to look at individual variables first to get a sense of the individual variables within the data set and make notes of things I want to continue to explore.
qplot(price, data = diamonds)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
qplot(price, data = diamonds, binwidth = 0.01) + scale_x_log10(breaks = c(100,
500, 1000, 1500, 2000, 5000, 10000, 15000))
## Warning: position_stack requires constant width: output may be incorrect
Transformed the long tail data to better understand the distribution of price. The tranformed price distribution appears bimodal with the price peaking around 800 or so and again at 5000 or so. Why is there a gap at 1500? Are there really no diamonds with that price? I wonder what this plot looks like across the categorical variables of cut, color, and clarity.
qplot(carat, data = diamonds, binwidth = 0.1) + coord_cartesian(xlim = c(0,
2))
qplot(carat, data = diamonds, binwidth = 0.01) + coord_cartesian(xlim = c(0,
2))
## Warning: position_stack requires constant width: output may be incorrect
Some carat weights occur more often than other carat weights. I wonder how carat is connected to price, and I wonder if the carat values are specific to certain cuts of diamonds.
For now, I'm going to see which carat weights are most common.
sort(table(diamonds$carat), decreasing = T)
##
## 0.3 0.31 1.01 0.7 0.32 1 0.9 0.41 0.4 0.71 0.5 0.33 0.51 0.34 1.02
## 2604 2249 2242 1981 1840 1558 1485 1382 1299 1294 1258 1189 1127 910 883
## 0.52 1.51 1.5 0.72 0.53 0.42 0.38 0.35 1.2 0.54 0.36 0.91 1.03 0.55 0.56
## 817 807 793 764 709 706 670 667 645 625 572 570 523 496 492
## 0.73 0.43 1.04 1.21 2.01 0.57 0.39 0.37 1.52 1.06 1.05 1.07 0.74 0.58 1.11
## 492 488 475 473 440 430 398 394 381 373 361 342 322 310 308
## 1.22 0.23 1.09 0.8 0.59 1.23 1.1 2 0.24 0.26 0.76 0.77 1.12 0.75 1.08
## 300 293 287 284 282 279 278 265 254 253 251 251 251 249 246
## 1.13 1.24 0.27 0.6 0.92 1.53 1.7 0.25 0.44 1.14 0.61 0.81 0.28 0.78 1.25
## 246 236 233 228 226 220 215 212 212 207 204 200 198 187 187
## 0.46 2.02 1.54 1.16 0.79 1.15 1.26 0.93 0.82 0.62 1.27 1.31 0.83 0.29 1.19
## 178 177 174 172 155 149 146 142 140 135 134 133 131 130 126
## 1.55 1.18 1.3 2.03 1.71 0.45 1.17 1.56 1.28 1.57 0.96 0.63 1.29 0.47 1.6
## 124 123 122 122 119 110 110 109 106 106 103 102 101 99 95
## 1.32 1.58 1.59 1.33 2.04 0.64 1.35 1.34 2.05 0.65 0.95 0.84 1.61 0.48 0.85
## 89 89 89 87 86 80 77 68 67 65 65 64 64 63 62
## 1.62 2.06 0.94 0.97 1.72 1.73 2.1 1.36 1.4 1.63 1.75 2.07 0.66 0.67 2.14
## 61 60 59 59 57 52 52 50 50 50 50 50 48 48 48
## 1.37 0.49 2.09 1.64 2.11 2.08 1.41 1.74 1.39 0.86 1.65 2.2 0.87 0.98 2.18
## 46 45 45 43 43 41 40 40 36 34 32 32 31 31 31
## 1.66 1.76 2.22 0.69 1.38 0.68 1.42 1.67 2.12 2.16 1.69 0.88 0.99 2.21 2.15
## 30 28 27 26 26 25 25 25 25 25 24 23 23 23 22
## 2.19 0.89 1.47 1.8 2.13 2.3 2.28 1.43 1.68 1.44 1.46 1.83 2.17 2.25 1.77
## 22 21 21 21 21 21 20 19 19 18 18 18 18 18 17
## 2.29 2.5 2.51 2.24 2.32 1.45 1.79 2.26 3.01 1.82 2.23 2.31 2.4 0.2 1.78
## 17 17 17 16 16 15 15 15 14 13 13 13 13 12 12
## 1.91 2.27 1.49 0.21 1.81 1.86 2.33 2.48 2.52 2.54 2.36 2.38 2.42 2.53 3
## 12 12 11 9 9 9 9 9 9 9 8 8 8 8 8
## 1.48 1.87 1.9 2.35 2.39 1.93 2.37 2.43 0.22 1.98 2.34 2.41 1.84 1.88 1.89
## 7 7 7 7 7 6 6 6 5 5 5 5 4 4 4
## 1.96 1.97 2.44 2.45 1.85 1.94 1.95 1.99 2.46 2.47 2.49 2.55 2.56 2.57 2.58
## 4 4 4 4 3 3 3 3 3 3 3 3 3 3 3
## 2.6 2.61 2.63 2.66 2.72 2.74 1.92 2.68 2.75 2.8 3.04 4.01 2.59 2.64 2.65
## 3 3 3 3 3 3 2 2 2 2 2 2 1 1 1
## 2.67 2.7 2.71 2.77 3.02 3.05 3.11 3.22 3.24 3.4 3.5 3.51 3.65 3.67 4
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 4.13 4.5 5.01
## 1 1 1
qplot(depth, data = diamonds, binwidth = 0.1) + coord_cartesian(xlim = c(55,
70))
## Warning: position_stack requires constant width: output may be incorrect
summary(diamonds$depth)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 43.0 61.0 61.8 61.7 62.5 79.0
Most diamonds have a depth between 60 mm and 65 mm: median 61.8 mm and mean 61.75 mm.
qplot(table, data = diamonds) + coord_cartesian(xlim = c(50, 70))
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
qplot(table, data = diamonds, binwidth = 0.1) + coord_cartesian(xlim = c(50,
70))
## Warning: position_stack requires constant width: output may be incorrect
summary(diamonds$table)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 43.0 56.0 57.0 57.5 59.0 95.0
Setting the binwidth indicates that most table values are integers. Most diamonds have a table between 55 mm and 60 mm.
sort(table(diamonds$table), decreasing = T)
##
## 56 57 58 59 55 60 54 61 62 63 53 64 65 66 52
## 9881 9724 8369 6572 6268 4241 2594 2282 1273 588 567 260 146 91 56
## 67 54.1 55.1 53.9 54.2 54.4 53.7 54.5 54.8 53.8 55.6 54.7 68 53.6 56.4
## 42 30 30 28 28 28 25 24 24 22 22 21 21 20 20
## 55.7 55.8 55.2 54.3 55.9 56.2 54.9 56.1 54.6 56.3 55.3 55.4 55.5 53.4 53.5
## 19 19 18 17 17 17 16 16 15 15 13 13 13 12 12
## 56.5 56.6 56.7 56.9 57.1 57.2 57.8 58.1 57.7 58.5 59.9 60.1 60.3 60.5 51
## 11 11 11 11 11 11 11 11 10 10 10 10 10 10 9
## 57.4 57.6 60.7 69 70 53.3 57.5 59.4 60.9 56.8 58.6 59.2 61.2 61.9 57.3
## 9 9 9 9 9 8 8 8 8 7 7 7 7 7 6
## 57.9 59.7 59.8 61.5 53.2 58.8 59.1 60.2 60.4 60.8 58.2 58.4 59.5 62.2 62.5
## 6 6 6 6 5 5 5 5 5 5 4 4 4 4 4
## 73 53.1 58.3 58.9 59.6 60.6 61.4 49 50 52.8 58.7 59.3 61.1 61.7 62.3
## 4 3 3 3 3 3 3 2 2 2 2 2 2 2 2
## 62.6 62.8 43 44 50.1 51.6 52.4 61.3 61.6 61.8 62.1 62.4 63.3 63.4 63.5
## 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1
## 64.2 64.3 65.4 71 76 79 95
## 1 1 1 1 1 1 1
Again, I wonder if this has anything to do with the cut of a diamond. Cut is the quality of a diamons may influence carat weight and is responsible for making a diamond sparkle. There's likely to be strong relationships among carat, table, cut, and price.
qplot(x, data = diamonds, binwidth = 0.1)
qplot(x, data = diamonds, geom = "freqpoly", binwidth = 0.1)
Most diamonds have an x dimension between 4 mm and 7 mm.
qplot(y, data = diamonds, binwidth = 0.1)
## Warning: position_stack requires constant width: output may be incorrect
qplot(y, data = diamonds, binwidth = 0.1) + coord_cartesian(xlim = c(0, 10))
## Warning: position_stack requires constant width: output may be incorrect
Again, most diamonds have a y dimension between 4 mm and 7 mm. There are some outliers for the y dimension.
qplot(z, data = diamonds, binwidth = 0.1)
## Warning: position_stack requires constant width: output may be incorrect
qplot(z, data = diamonds, binwidth = 0.1) + coord_cartesian(xlim = c(0, 7))
## Warning: position_stack requires constant width: output may be incorrect
Most diamonds have a z dimension between 2 mm and 6 mm. There are some outliers for the z dimension too.
qplot(carat, data = diamonds, binwidth = 0.01, fill = cut) + coord_cartesian(xlim = c(0,
2)) + guides(fill = guide_legend(reverse = T))
## Warning: position_stack requires constant width: output may be incorrect
by(carat, cut, summary)
## Error: object 'carat' not found
It doesn't look like particular cuts have a certain number of carats. It looks like most of the ideal cut diamonds are less than one carat. I'm going to look at those values to be sure.
sort(table(subset(diamonds, cut == "Ideal")$carat), decreasing = T)
##
## 0.3 0.31 0.32 0.33 0.41 0.7 0.4 0.51 0.34 0.71 0.52 0.53 1.01 0.5 0.42
## 1247 1209 1066 673 667 560 545 525 508 499 459 429 426 388 387
## 0.38 0.35 0.54 0.72 0.56 0.36 0.55 1.02 0.73 0.57 1.03 0.43 0.9 1 1.2
## 378 374 372 366 313 309 302 272 243 239 225 221 215 208 194
## 1.04 0.58 1.51 0.39 1.06 0.37 1.21 1.07 1.05 1.09 0.59 0.74 1.11 1.23 0.76
## 187 182 182 181 174 172 160 156 147 141 139 135 130 127 122
## 1.08 1.22 1.1 1.5 0.27 1.13 1.52 0.6 0.8 0.91 0.26 0.61 1.12 0.44 1.24
## 121 121 118 117 113 113 109 108 108 108 106 105 103 96 96
## 0.77 0.75 1.14 0.81 1.16 0.46 0.78 0.28 2.01 1.25 1.53 0.24 1.26 0.25 0.79
## 92 90 88 87 87 83 82 81 78 77 74 69 69 66 66
## 0.82 0.62 0.83 1.7 1.15 1.17 1.27 1.55 0.92 1.54 0.29 1.18 1.31 0.63 1.19
## 60 59 59 59 56 56 56 56 55 53 52 51 51 49 49
## 1.28 1.57 2.02 1.56 0.23 0.45 0.47 1.58 1.29 1.6 0.64 2 1.3 2.03 1.71
## 46 46 46 45 44 42 42 42 40 40 39 39 38 38 37
## 0.93 1.59 1.32 0.85 1.34 1.33 1.35 0.65 1.61 1.62 0.48 1.63 0.66 0.84 1.37
## 35 35 33 31 30 28 28 27 26 25 23 23 21 21 19
## 1.75 2.07 0.96 0.97 1.36 2.05 2.04 2.06 0.87 1.74 0.95 1.65 1.67 2.1 1.39
## 18 18 17 17 17 17 16 16 15 15 14 14 14 14 13
## 1.4 1.64 0.86 2.08 2.09 2.14 2.2 0.94 1.38 1.66 1.68 1.8 2.16 2.3 0.88
## 13 13 12 12 12 12 12 11 11 11 11 11 11 11 10
## 1.41 1.72 1.76 2.11 2.15 1.42 1.69 0.67 1.43 2.12 2.18 2.22 1.73 2.24 2.4
## 10 10 10 10 10 9 9 8 8 8 8 8 7 7 7
## 0.49 0.89 2.13 2.19 2.21 2.28 2.36 0.69 0.98 0.99 2.17 2.25 2.26 2.32 1.77
## 6 6 6 6 6 6 6 5 5 5 5 5 5 5 4
## 1.79 2.37 2.5 0.2 0.68 1.44 1.46 1.49 1.83 1.87 1.91 2.27 2.29 2.51 2.54
## 4 4 4 3 3 3 3 3 3 3 3 3 3 3 3
## 1.78 1.85 1.89 1.9 1.98 2.33 2.39 2.45 2.46 2.53 2.61 2.72 3.01 1.45 1.47
## 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1
## 1.48 1.82 1.84 1.86 1.92 1.93 2.34 2.41 2.42 2.43 2.47 2.48 2.49 2.52 2.56
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 2.59 2.6 2.63 2.64 2.75 3.22 3.5
## 1 1 1 1 1 1 1
Most ideal cut diamonds are under 1.25 carats.
qplot(cut, data = diamonds, geom = "bar", fill = cut)
Most diamonds have ideal cut, which is almost double the amount of very good cut diamonds.
qplot(x = cut, y = price, data = diamonds, geom = "boxplot")
by(diamonds$price, diamonds$cut, summary)
## diamonds$cut: Fair
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 337 2050 3280 4360 5210 18600
## --------------------------------------------------------
## diamonds$cut: Good
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 327 1140 3050 3930 5030 18800
## --------------------------------------------------------
## diamonds$cut: Very Good
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 336 912 2650 3980 5370 18800
## --------------------------------------------------------
## diamonds$cut: Premium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 326 1050 3180 4580 6300 18800
## --------------------------------------------------------
## diamonds$cut: Ideal
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 326 878 1810 3460 4680 18800
Ideal diamonds have the lowest median price. This seems really unusual since I would expect diamonds with an ideal cut to have a higher median price compared to the other groups. There are many outliers. The variation in price tends to increase as cut improves and then decreases for diamonds with ideal cuts. What about price/carat for these cuts?
qplot(x = cut, y = price/carat, data = diamonds, geom = "boxplot")
qplot(x = cut, y = price/carat, data = diamonds, geom = "boxplot") + coord_cartesian(ylim = c(0,
6000))
with(diamonds, by(price/carat, cut, summary))
## cut: Fair
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1170 2740 3450 3770 4510 10900
## --------------------------------------------------------
## cut: Good
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1080 2390 3610 3860 4790 15900
## --------------------------------------------------------
## cut: Very Good
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1140 2330 3610 4010 5020 17800
## --------------------------------------------------------
## cut: Premium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1050 2590 3760 4220 5320 17100
## --------------------------------------------------------
## cut: Ideal
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1110 2460 3310 3920 4770 17100
Wow! Ideal diamonds have the lowest median for price per carat. The variance across the groups seems to be about the same with Fair cut diamonds having the least variation for the middle 50% of diamonds.
qplot(color, data = diamonds, geom = "bar", fill = color)
Most diamonds have have color ratings between E and H.
qplot(x = color, y = price, data = diamonds, geom = "boxplot")
by(diamonds$price, diamonds$color, summary)
## diamonds$color: D
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 357 911 1840 3170 4210 18700
## --------------------------------------------------------
## diamonds$color: E
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 326 882 1740 3080 4000 18700
## --------------------------------------------------------
## diamonds$color: F
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 342 982 2340 3720 4870 18800
## --------------------------------------------------------
## diamonds$color: G
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 354 931 2240 4000 6050 18800
## --------------------------------------------------------
## diamonds$color: H
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 337 984 3460 4490 5980 18800
## --------------------------------------------------------
## diamonds$color: I
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 334 1120 3730 5090 7200 18800
## --------------------------------------------------------
## diamonds$color: J
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 335 1860 4230 5320 7700 18700
Here is another surprise. The lowest median price diamonds have a color of D, which is the best color in the data set. Price variance increases as the color decreases (best color is D and the worst color is J). The median price typically decreases as color improves. Let's look at price per carat by color.
qplot(x = color, y = price/carat, data = diamonds, geom = "boxplot")
with(diamonds, by(price/carat, color, summary))
## color: D
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1130 2460 3410 3950 4750 17800
## --------------------------------------------------------
## color: E
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1080 2430 3250 3800 4510 14600
## --------------------------------------------------------
## color: F
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1170 2590 3490 4130 4950 13900
## --------------------------------------------------------
## color: G
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1140 2540 3490 4160 5500 12500
## --------------------------------------------------------
## color: H
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1050 2400 3820 4010 5130 10200
## --------------------------------------------------------
## color: I
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1150 2340 3780 4000 5200 9400
## --------------------------------------------------------
## color: J
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1080 2560 3780 3830 4930 8650
The best color diamonds (D and E) have the lowest median price. Again, this is such an unusual trend. This also seems strange since most diamonds in the data set are not of color D. Let's split up the price / carat distribution by color.
qplot(x = price/carat, data = diamonds, fill = color)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
qplot(x = price/carat, data = diamonds, fill = color) + facet_wrap(~cut)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
It looks like the diamonds with better cuts and color tend to have lower price / carat values. This provides some explanation for the odd low median price and price / carat for better cuts and colors, but I'm still not clear on this. I'm going to keep this in mind and try to explore the same plots for clarity.
qplot(clarity, data = diamonds, geom = "bar", fill = clarity)
Most diamonds have average clarity ratings. Very few diamonds have the worst or best clarity rating, like the rating pattern for color.
qplot(x = clarity, y = price, data = diamonds, geom = "boxplot")
by(diamonds$price, diamonds$clarity, summary)
## diamonds$clarity: I1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 345 2080 3340 3920 5160 18500
## --------------------------------------------------------
## diamonds$clarity: SI2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 326 2260 4070 5060 5780 18800
## --------------------------------------------------------
## diamonds$clarity: SI1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 326 1090 2820 4000 5250 18800
## --------------------------------------------------------
## diamonds$clarity: VS2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 334 900 2050 3920 6020 18800
## --------------------------------------------------------
## diamonds$clarity: VS1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 327 876 2000 3840 6020 18800
## --------------------------------------------------------
## diamonds$clarity: VVS2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 336 794 1310 3280 3640 18800
## --------------------------------------------------------
## diamonds$clarity: VVS1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 336 816 1090 2520 2380 18800
## --------------------------------------------------------
## diamonds$clarity: IF
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 369 895 1080 2860 2390 18800
Here again, there is a trend that goes against my intuition. The lowest median price occurs for the best clarity (IF). There also to be many more outliers for the better clarity diamonds. I'm not sure why great clarity diamonds are price so low. Another trend to note here is that price variance increases then decreases significantly as the clarity improves.
I want to look at two things: price per clarity, and the distribution of prices for diamonds with best levels of the categorical variables.
qplot(x = clarity, y = price/carat, data = diamonds, geom = "boxplot")
with(diamonds, by(price/carat, clarity, summary))
## clarity: I1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1050 2110 2890 2800 3350 6350
## --------------------------------------------------------
## clarity: SI2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1080 3000 3950 4010 4740 9910
## --------------------------------------------------------
## clarity: SI1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1130 2360 3670 3850 4930 9690
## --------------------------------------------------------
## clarity: VS2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1150 2440 3430 4080 5480 12500
## --------------------------------------------------------
## clarity: VS1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1210 2410 3450 4160 5490 12400
## --------------------------------------------------------
## clarity: VVS2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1340 2450 3170 4200 4940 13400
## --------------------------------------------------------
## clarity: VVS1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1400 2540 2980 3850 4060 14500
## --------------------------------------------------------
## clarity: IF
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1590 2860 3160 4260 4280 17800
This plot seems more reasonable. The lowest median price per carat has clarity I1 which is the lowest clarity rating. The median increases slightly then holds relatively constant before decreasing again for the highest clarity. The variance increases then decreases across the clarity levels from worst to best.
qplot(x = price/carat, data = diamonds, fill = clarity) + guides(fill = guide_legend(reverse = T))
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
qplot(x = price/carat, data = diamonds, fill = clarity) + facet_wrap(~cut)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
qplot(x = price/carat, data = diamonds, fill = clarity) + facet_wrap(~color)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
The histogram and faceted histograms somewhat explain the odd trends as again there is a greater number of ideal diamonds, color D diamonds, and clarity IF diamonds in the lower price ranges. Let's look at the price distribution of the higher quality diamonds in cut, color, and clarity.
bestDiamonds <- subset(diamonds, (color == "D" | color == "E") & (clarity ==
"IF" | clarity == "VVS1") & (cut == "Ideal" | cut == "Premium"))
qplot(x = price, data = bestDiamonds)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
summary(bestDiamonds$price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 553 967 1210 2890 2640 18700
summary(bestDiamonds$price/bestDiamonds$carat)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2170 2980 3420 4710 5020 17100
Let's compare to the worst diamonds across the same variables.
worstDiamonds <- subset(diamonds, (color == "J" | color == "I") & (clarity ==
"I1" | clarity == "SI2") & (cut == "Fair" | cut == "Good"))
qplot(x = price, data = worstDiamonds)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
summary(worstDiamonds$price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 335 2810 4310 5750 7560 18500
summary(worstDiamonds$price/worstDiamonds$carat)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1080 2640 3320 3580 4280 7440
This doesn't add much to my thoughts already. I'm going create density plots that are similar to the price histograms earlier to examine the price for each level of cut, color, and clarity.
qplot(price, data = diamonds, binwidth = 0.01, color = cut, geom = "density") +
scale_x_log10(breaks = c(100, 500, 800, 1500, 5000, 10000))
qplot(price, data = diamonds, binwidth = 0.01, color = color, geom = "density") +
scale_x_log10(breaks = c(100, 500, 800, 1500, 5000, 10000))
qplot(price, data = diamonds, binwidth = 0.01, color = clarity, geom = "density") +
scale_x_log10(breaks = c(100, 500, 800, 1500, 5000, 10000)) + guides(color = guide_legend(reverse = T))
These density plots explain the odd trends that were seen in the box plots earlier. Diamonds with better levels of clarity, cut, and color tend to occur more often at lower prices while diamonds with worse levels of clarity, cut, and color tend to occur more often at higher prices. I am wondering about price / carat too.
qplot(price/carat, data = diamonds, binwidth = 0.01, color = cut, geom = "density") +
scale_x_log10(breaks = c(1000, 2000, 3500, 5000, 10000))
qplot(price/carat, data = diamonds, binwidth = 0.01, color = color, geom = "density") +
scale_x_log10(breaks = c(1000, 2000, 3500, 5000, 10000))
qplot(price/carat, data = diamonds, binwidth = 0.01, color = clarity, geom = "density") +
scale_x_log10(breaks = c(1000, 2000, 3000, 4500, 7000, 10000)) + guides(color = guide_legend(reverse = T))
These plots support the variability and trends that the boxplots showed from before. I am going see which variables correlate with price and try to work towards building a linear model to predict price.
library(lsr)
correlate(diamonds)
##
## CORRELATIONS
## ============
## - correlation type: pearson
## - correlations shown only when both variables are numeric
##
## carat cut color clarity depth table price x y z
## carat . . . . 0.028 0.182 0.922 0.975 0.952 0.953
## cut . . . . . . . . . .
## color . . . . . . . . . .
## clarity . . . . . . . . . .
## depth 0.028 . . . . -0.296 -0.011 -0.025 -0.029 0.095
## table 0.182 . . . -0.296 . 0.127 0.195 0.184 0.151
## price 0.922 . . . -0.011 0.127 . 0.884 0.865 0.861
## x 0.975 . . . -0.025 0.195 0.884 . 0.975 0.971
## y 0.952 . . . -0.029 0.184 0.865 0.975 . 0.952
## z 0.953 . . . 0.095 0.151 0.861 0.971 0.952 .
The dimensions of a diamond tend to correlate with each other. The longer one dimension, then the larger the diamond. The dimensions also correlate with carat weight which makes sense. Price correlates strongly with carat weight and the three dimensions (x, y, z).
library(GGally)
## Loading required package: reshape
## Loading required package: plyr
##
## Attaching package: 'reshape'
##
## The following objects are masked from 'package:plyr':
##
## rename, round_any
library(memisc)
## Loading required package: lattice
## Loading required package: MASS
## Loading required namespace: car
##
## Attaching package: 'memisc'
##
## The following object is masked from 'package:reshape':
##
## rename
##
## The following object is masked from 'package:plyr':
##
## rename
##
## The following objects are masked from 'package:stats':
##
## contr.sum, contr.treatment, contrasts
##
## The following object is masked from 'package:base':
##
## as.array
# sample 5,000 diamonds from the data set
set.seed(281)
diamond_samp <- diamonds[sample(1:length(diamonds$price), 5000), ]
ggpairs(diamond_samp, params = c(shape = I("."), outlier.shape = I(".")))
I want to closer at scatter plots involving price and some other variables: carat, table, depth, and volume.
qplot(x = carat, y = price, data = diamonds)
qplot(x = carat, y = price, data = diamonds) + coord_cartesian(xlim = c(0, quantile(diamonds$carat,
0.99)))
As carat size increases, the variance in price increases. We still see vertical bands where many diamonds take on the same carat value at different price points. The relationship between price and carat appears to be exponential rather than linear.
qplot(x = table, y = price, data = diamonds)
Again, the tall vertical strips indicate table values are mostly integers. A few outliers below 50 mm and one above 90 mm.
qplot(x = table, y = price, data = diamonds, color = cut)
qplot(x = table, y = price, data = diamonds, color = cut) + facet_wrap(~clarity)
qplot(x = table, y = price, data = diamonds, color = cut) + facet_wrap(~color)
Levels of cut cluster by table value. This may make sense based on the type of cut as certain cuts produce certain dimensions. The pattern holds across each level of clarity and each level of color with the exception of the lowest clarity.
qplot(x = table, y = price, data = diamonds, color = color)
Nothing stands out in the plot above.
qplot(x = table, y = price, data = diamonds, color = clarity)
Nothing stands out in the plot above.
qplot(x = depth, y = price, data = diamonds)
ggplot(aes(x = depth, y = price), data = diamonds) + geom_point(alpha = 1/50)
First plot suffers from overplotting. Most diamonds have a depth between 60 and 65 (no units).
What about the volume of diamonds? Does it have any relationships with price and other variables in the data set? I'm going to use a rough approximation of volume by using x * y * z to approximate a diamond as if it were a rectangular prism, basically a box.
diamonds <- transform(diamonds, volume = x * y * z)
qplot(x = volume, y = price, data = diamonds)
Some diamonds have a volume of 0. Why? There's other outliers: expensive diamond near volume of 4000 and less expensive diamond priced below 1000.
table(diamonds$volume == 0)
##
## FALSE TRUE
## 53920 20
noVolume <- subset(diamonds, volume == 0)
noVolume
## carat cut color clarity depth table price x y z volume
## 2208 1.00 Premium G SI2 59.1 59 3142 6.55 6.48 0 0
## 2315 1.01 Premium H I1 58.1 59 3167 6.66 6.60 0 0
## 4792 1.10 Premium G SI2 63.0 59 3696 6.50 6.47 0 0
## 5472 1.01 Premium F SI2 59.2 58 3837 6.50 6.47 0 0
## 10168 1.50 Good G I1 64.0 61 4731 7.15 7.04 0 0
## 11183 1.07 Ideal F SI2 61.6 56 4954 0.00 6.62 0 0
## 11964 1.00 Very Good H VS2 63.3 53 5139 0.00 0.00 0 0
## 13602 1.15 Ideal G VS2 59.2 56 5564 6.88 6.83 0 0
## 15952 1.14 Fair G VS1 57.5 67 6381 0.00 0.00 0 0
## 24395 2.18 Premium H SI2 59.4 61 12631 8.49 8.45 0 0
## 24521 1.56 Ideal G VS2 62.2 54 12800 0.00 0.00 0 0
## 26124 2.25 Premium I SI1 61.3 58 15397 8.52 8.42 0 0
## 26244 1.20 Premium D VVS1 62.1 59 15686 0.00 0.00 0 0
## 27113 2.20 Premium H SI1 61.2 59 17265 8.42 8.37 0 0
## 27430 2.25 Premium H SI2 62.8 59 18034 0.00 0.00 0 0
## 27504 2.02 Premium H VS2 62.7 53 18207 8.02 7.95 0 0
## 27740 2.80 Good G SI2 63.8 58 18788 8.90 8.85 0 0
## 49557 0.71 Good F SI2 64.1 60 2130 0.00 0.00 0 0
## 49558 0.71 Good F SI2 64.1 60 2130 0.00 0.00 0 0
## 51507 1.12 Premium G I1 60.4 59 2383 6.71 6.67 0 0
summary(noVolume$price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2130 3560 5350 8800 15500 18800
summary(diamonds$price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 326 950 2400 3930 5320 18800
table(diamonds$carat == 0)
##
## FALSE
## 53940
All dimensions (x, y, and z) are missing or the z value is 0. The diamonds in this subset tend to be very expensive or fall in the third quartile of the entire diamonds data set. Other variables such as carat, depth, table, and price are reported so I'll assume those values can be trusted.
quantile(diamonds$volume, probs = seq(0, 1, 0.01))
## 0% 1% 2% 3% 4% 5% 6% 7% 8%
## 0.00 40.08 43.94 47.69 48.66 49.18 49.61 49.97 50.32
## 9% 10% 11% 12% 13% 14% 15% 16% 17%
## 50.69 51.09 51.54 51.93 52.40 52.85 53.38 53.99 54.73
## 18% 19% 20% 21% 22% 23% 24% 25% 26%
## 55.55 56.60 57.80 59.16 60.89 62.65 63.94 65.14 66.00
## 27% 28% 29% 30% 31% 32% 33% 34% 35%
## 66.67 67.27 68.06 69.01 70.54 73.70 79.72 81.55 82.57
## 36% 37% 38% 39% 40% 41% 42% 43% 44%
## 83.46 84.46 85.52 86.71 87.97 89.47 90.99 92.80 94.98
## 45% 46% 47% 48% 49% 50% 51% 52% 53%
## 98.02 103.00 110.53 112.63 113.86 114.81 115.69 116.59 117.50
## 54% 55% 56% 57% 58% 59% 60% 61% 62%
## 118.75 120.43 123.04 126.36 130.20 135.45 141.72 144.29 146.28
## 63% 64% 65% 66% 67% 68% 69% 70% 71%
## 148.14 151.06 156.45 159.09 160.63 161.88 162.83 163.74 164.85
## 72% 73% 74% 75% 76% 77% 78% 79% 80%
## 166.03 167.28 168.83 170.84 172.95 175.61 178.47 181.69 185.07
## 81% 82% 83% 84% 85% 86% 87% 88% 89%
## 188.65 192.61 195.85 198.55 201.97 206.50 213.26 225.33 238.08
## 90% 91% 92% 93% 94% 95% 96% 97% 98%
## 242.31 245.23 248.55 253.99 263.07 276.53 307.71 324.37 333.91
## 99% 100%
## 354.43 3840.60
quantile(diamonds$volume, 0.999)
## 99.9%
## 431.6
qplot(x = volume, y = price, color = I("orange"), alpha = I(1/20), data = subset(diamonds,
volume > 0 & volume <= quantile(diamonds$volume, 0.999))) + geom_smooth(method = "lm",
se = T, color = "blue")
As the volume increases, the variance in price increases. That is, the data becomes more dispersed. The relationship does not look linear and appears more exponential, especially in the original plot of price vs. volume.
volumePriceLM <- lm(price ~ volume, data = subset(diamonds, volume > 0 & volume <=
quantile(diamonds$volume, 0.999)))
summary(volumePriceLM)
##
## Call:
## lm(formula = price ~ volume, data = subset(diamonds, volume >
## 0 & volume <= quantile(diamonds$volume, 0.999)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -10739 -809 -12 565 12575
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.38e+03 1.29e+01 -185 <2e-16 ***
## volume 4.87e+01 8.59e-02 566 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1510 on 53864 degrees of freedom
## Multiple R-squared: 0.856, Adjusted R-squared: 0.856
## F-statistic: 3.21e+05 on 1 and 53864 DF, p-value: <2e-16
Based on the R2 value, volume explains about 85 percent of the variance in price. Let's keep looking at other variables, including the categorical ones. The volume was a rough approximation though since the diamonds were assumed to be box shaped. Let's use carat and the density of diamonds to better approximate volume.
table(diamonds$carat == 0)
##
## FALSE
## 53940
I'll also have volumes for diamonds that had missing x, y, or z dimensions since carat weight was reported for all diamonds.
1 carat is equivalent to 2 grams
Let's use the average density of diamonds to compute the volume. Using Google, I found that diamond density is typically 3.15-3.53 g/cm3 with pure diamonds having a density close to 3.52 g/cm3. I'm going to use the average density 3.34 g/cm3 to estimate the volume of the diamonds.
# create a volume from carat and density of diamonds
diamonds <- transform(diamonds, volume = carat * 2 * 3.34)
qplot(x = volume, y = price, data = diamonds)
No volumes that are 0. Still have some outliers, but they are less extreme.
qplot(y = price, x = volume, data = diamonds, color = clarity) + scale_color_brewer(type = "div") +
scale_y_log10()
# There's the same outliers as before. Look at diamonds with volumes < 1000
# and > 0.
qplot(y = price, x = volume, data = subset(diamonds, volume < 1000 & volume >
0), color = clarity) + scale_color_brewer(type = "div") + scale_y_log10()
Diamonds are priced higher for better clarity holding volume constant.
qplot(y = price, x = volume, data = subset(diamonds, volume < 1000 & volume >
0), color = cut) + scale_color_brewer(type = "div") + scale_y_log10()
We lose the pattern when coloring by cut.
qplot(y = price, x = volume, data = subset(diamonds, volume < 1000 & volume >
0), color = color) + scale_color_brewer(type = "div") + scale_y_log10()
Diamonds with better color tend to be priced higher holding volume constant. This trend is not as clear or stong when the points were colored by clarity, but the trend is still present.
The relationship between price and carat seemed exponential. Let's use the cube root of carat weight to further explore the relationship of price and carat. This seems reasonable given that carat weight is a function of volume and the volume is a function of a diamond's dimensions.
library(scales)
##
## Attaching package: 'scales'
##
## The following object is masked from 'package:memisc':
##
## percent
cuberoot_trans = function() trans_new("cuberoot", transform = function(x) x^(1/3),
inverse = function(x) x^3)
ggplot(aes(carat, price), data = diamonds) + geom_point() + scale_x_continuous(trans = cuberoot_trans(),
limits = c(0.2, 3), breaks = c(0.2, 0.5, 1, 2, 3)) + scale_y_continuous(trans = log10_trans(),
limits = c(350, 15000), breaks = c(350, 1000, 5000, 10000, 15000)) + ggtitle("Price (log10) by Cube-Root of Carat")
## Warning: Removed 1683 rows containing missing values (geom_point).
ggplot(aes(x = carat, y = price, color = clarity), data = diamonds) + geom_point(alpha = 0.5,
size = 1, position = "jitter") + scale_color_brewer(type = "div", guide = guide_legend(title = "Clarity",
reverse = T, override.aes = list(alpha = 1, size = 2))) + scale_x_continuous(trans = cuberoot_trans(),
limits = c(0.2, 3), breaks = c(0.2, 0.5, 1, 2, 3)) + scale_y_continuous(trans = log10_trans(),
limits = c(350, 15000), breaks = c(350, 1000, 5000, 10000, 15000)) + ggtitle("Price (log10) by Cube-Root of Carat and Clarity")
## Warning: Removed 1691 rows containing missing values (geom_point).
Holding carat weight constant, diamonds with lower clarity are almost always cheaper than diamonds with better clarity (worst clarity is I1 and best clarity is IF).
ggplot(aes(x = carat, y = price, color = cut), data = diamonds) + geom_point(alpha = 0.5,
size = 1, position = "jitter") + scale_color_brewer(type = "div", guide = guide_legend(title = "Cut",
reverse = T, override.aes = list(alpha = 1, size = 2))) + scale_x_continuous(trans = cuberoot_trans(),
limits = c(0.2, 3), breaks = c(0.2, 0.5, 1, 2, 3)) + scale_y_continuous(trans = log10_trans(),
limits = c(350, 15000), breaks = c(350, 1000, 5000, 10000, 15000)) + ggtitle("Price (log10) by Cube-Root of Carat and Cut")
## Warning: Removed 1691 rows containing missing values (geom_point).
Price does not vary as much on cut holding carat constant; the pattern is not noticeable here.
ggplot(aes(x = carat, y = price, color = color), data = diamonds) + geom_point(alpha = 0.5,
size = 1, position = "jitter") + scale_color_brewer(type = "div", guide = guide_legend(title = "Color",
override.aes = list(alpha = 1, size = 2))) + scale_x_continuous(trans = cuberoot_trans(),
limits = c(0.2, 3), breaks = c(0.2, 0.5, 1, 2, 3)) + scale_y_continuous(trans = log10_trans(),
limits = c(350, 15000), breaks = c(350, 1000, 5000, 10000, 15000)) + ggtitle("Price (log10) by Cube-Root of Carat and Color")
## Warning: Removed 1691 rows containing missing values (geom_point).
Color does seem to explain some of the variance in price as was the case with the clarity variable.
The last 3 plots suggest that we can build a linear model and use those variables in the linear model to predict the price of a diamond.
m1 <- lm(I(log(price)) ~ I(carat^(1/3)), data = diamonds)
m2 <- update(m1, ~. + carat)
m3 <- update(m2, ~. + clarity)
m4 <- update(m3, ~. + cut)
m5 <- update(m4, ~. + color)
mtable(m1, m2, m3, m4, m5)
##
## Calls:
## m1: lm(formula = I(log(price)) ~ I(carat^(1/3)), data = diamonds)
## m2: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat, data = diamonds)
## m3: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + clarity,
## data = diamonds)
## m4: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + clarity +
## cut, data = diamonds)
## m5: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + clarity +
## cut + color, data = diamonds)
##
## ======================================================================
## m1 m2 m3 m4 m5
## ----------------------------------------------------------------------
## (Intercept) 2.821*** 1.039*** 0.464*** 0.391*** 0.415***
## (0.006) (0.019) (0.014) (0.014) (0.010)
## I(carat^(1/3)) 5.558*** 8.568*** 9.319*** 9.376*** 9.144***
## (0.007) (0.032) (0.023) (0.023) (0.016)
## carat -1.137*** -1.260*** -1.274*** -1.093***
## (0.012) (0.008) (0.008) (0.006)
## clarity: .L 0.889*** 0.854*** 0.907***
## (0.005) (0.005) (0.003)
## clarity: .Q -0.255*** -0.239*** -0.240***
## (0.005) (0.005) (0.003)
## clarity: .C 0.143*** 0.129*** 0.131***
## (0.004) (0.004) (0.003)
## clarity: ^4 -0.086*** -0.080*** -0.063***
## (0.003) (0.003) (0.002)
## clarity: ^5 0.038*** 0.034*** 0.026***
## (0.003) (0.003) (0.002)
## clarity: ^6 0.001 0.004 -0.002
## (0.002) (0.002) (0.002)
## clarity: ^7 0.054*** 0.051*** 0.032***
## (0.002) (0.002) (0.001)
## cut: .L 0.125*** 0.120***
## (0.003) (0.002)
## cut: .Q -0.034*** -0.031***
## (0.003) (0.002)
## cut: .C 0.016*** 0.014***
## (0.002) (0.002)
## cut: ^4 -0.001 -0.002
## (0.002) (0.001)
## color: .L -0.441***
## (0.002)
## color: .Q -0.093***
## (0.002)
## color: .C -0.013***
## (0.002)
## color: ^4 0.012***
## (0.002)
## color: ^5 -0.003*
## (0.001)
## color: ^6 0.001
## (0.001)
## ----------------------------------------------------------------------
## R-squared 0.924 0.935 0.967 0.968 0.984
## adj. R-squared 0.924 0.935 0.967 0.968 0.984
## sigma 0.280 0.259 0.185 0.181 0.129
## F 652012.063 387489.366 175093.345 125821.403 173791.084
## p 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -7962.499 -3631.319 14605.945 15580.358 34091.272
## Deviance 4242.831 3613.360 1837.549 1772.344 892.214
## AIC 15930.999 7270.637 -29189.890 -31130.717 -68140.544
## BIC 15957.685 7306.220 -29092.038 -30997.282 -67953.736
## N 53940 53940 53940 53940 53940
## ======================================================================
The variables in this linear model can account for 98.4% of the variance in the price of diamonds. The addition of the cut variable to the model slightly improves the R2 value by one tenth of a percent, which is expected based on the visualization above of Log10 Price vs. Cube-Root Carat and Cut.
qplot(price, data = diamonds, binwidth = 0.01, fill = I("#099DD9")) + scale_x_log10(breaks = c(1000,
1500, 10000)) + ggtitle("Log10 Price") + xlab("Price (in dollars)") + ylab("Number of Diamonds")
## Warning: position_stack requires constant width: output may be incorrect
The distribution of diamond prices appears to be bimodal, perhaps due to the demand of diamonds and buyers purchasing in two different ranges of price points.
library(gridExtra)
## Loading required package: grid
plot1 <- qplot(x = clarity, y = price, data = diamonds, geom = "boxplot", fill = clarity) +
ggtitle("Diamond Prices by Cut") + xlab("Clarity") + ylab("Price (in dollars)") +
coord_cartesian(ylim = c(0, 7000)) + theme(plot.title = element_text(size = 16),
legend.position = "none")
plot2 <- qplot(price, data = diamonds, binwidth = 0.01, color = clarity, geom = "density") +
scale_x_log10(breaks = c(1000, 2000, 3000, 4500, 7000, 10000)) + guides(color = guide_legend(reverse = T)) +
xlab("Price/Carat ($/ct)") + ylab("Density") + ggtitle("Density of Price/Carat by Clarity") +
theme(plot.title = element_text(size = 16))
grid.arrange(plot1, plot2, ncol = 1)
Diamonds with the best level of clarity (IF) have the lowest median price. A greater proportion of diamonds with the best clarity are priced lowered compared to the proportion of diamonds in other price distributions for worse levels of clarity. Price variance increases as the clarity improves (worst clarity is I1).
ggplot(aes(x = carat, y = price, color = clarity), data = diamonds) + geom_point(alpha = 0.5,
size = 1, position = "jitter") + scale_color_brewer(type = "div", guide = guide_legend(title = "Clarity",
reverse = T, override.aes = list(alpha = 1, size = 2))) + scale_x_continuous(trans = cuberoot_trans(),
limits = c(0.2, 3), breaks = c(0.2, 0.5, 1, 2, 3)) + scale_y_continuous(trans = log10_trans(),
limits = c(350, 15000), breaks = c(350, 1000, 5000, 10000, 15000)) + ggtitle("Price (log10) by Cube-Root of Carat and Clarity") +
theme(plot.title = element_text(size = 16))
## Warning: Removed 1692 rows containing missing values (geom_point).
Holding carat weight constant, diamonds with higher clarity levels (I1 is worst and IF is best) are almost always cheaper than diamonds with better clarity. The plot indicates that a linear model could be constructed to predict the price of variables using log10(price) as the outcome variable and cube-root of carat as the predictor variable
The diamonds data set contains information on almost 54,000 thousand diamonds from around 2008. I started by understanding the individual variables in the data set, and then I explored interesting questions and leads as I continued to make observations on plots. Eventually, I explored the price of diamonds across many variables and created a linear model to predict diamond prices. I was surprised that depth or table did not have a strong positive correlation with price, but these variables are likely to be represented by categorical variables: color, cut, and clarity. I struggled understanding the decrease in median price as the level of cut and clarity improved, but this became more clear when I realized that most of the data contained ideal cut diamonds. For the linear model, all diamonds were included since information on price, carat, color, clarity, and cut were available for all the diamonds. Some limitations of this model include the source of the data. Given that the diamonds date to 2008, the model would likely undervalue diamonds in the market today, either due to changes in demand and supply or inflation rates. To investigat this data further, I would examine how values of 0 were introduced into the data set for the variables volume, x, y, and z. I would be interested in testing the linear model to predict current diamond prices and to determine to what extent the model is accurate at pricing diamonds. A more recent data would be better to make predictions of diamond prices, and comparisons might be made between the other linear models to see if other variables may account for diamond prices.